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ABSTRACT 

A real-time fault detection and diagnosis capar 
bility is absolutely crucial in the design of large- 
scale space systems. Some of the existing AI- 
based fault diagnostic techniques like expert sys- 
tems and qualitative modelling are frequently 
ill-suited for this purpose. Expert systems are 
often inadequately structured, difficult to vali- 
date and suffer from knowledge acquisition bot- 
tlenecks. Qualitative modelling techniques some- 
times generate a large number of failure source 
alternatives, thus hampering speedy diagnosis. 

In this paper we present a graph-based tech- 
nique which is well suited for real-time fault di- 
agnosis, structured knowledge representation and 
acquisition and testing k validation. A Hierar- 
chical Fault Model of the system to be diagnosed 
is developed. At each level of hierarchy, there 
exist fault propagation digraphs denoting causal 
relations between failure-modes of subsystems. 
The edges of such a digraph are weighted with 
fault propagation probabilities and fault propar 
gat ion time intervals. Efficient and rest art able 
graph algorithms are used for on-line speedy 
identification of failure source components. 

INTRODUCTION 

A very high degree of automation and complexity 
is evident in modern industrial plants and space 
systems. This trend towards fully automatic and 
largely unmanned complex systems necessitates 
the development of a real-time fault detection 


and diagnosis capability. Such a capability would 
lead to shorter repair times and longer system 
operational times, thus enhancing productivity. 
Signal processing techniques coupled with mod- 
ern sensors are capable of fault detection and 
alarm generation. Advances in computer tech- 
nology such as multi-pro cessing allow improve- 
ments in real-time performance. New artificial 
intelligence (AI) programming techniques, such 
as declarative languages and symbolic processing 
are very efficient for representing and processing 
the failure models of systems. 

PROBLEM STATEMENT 

A real-time fault diagnostics system has to func- 
tion in an environment where new alarms may 
constantly be generated, due to the propagation 
of failures. To cope with such a time-changing 
scenario the diagnostics system must have the 
following characteristics: 

• Signal Processing, Alarm Generation and 
Failure Source Identification software must 
be as fast as possible. The first two are usu- 
ally standard well-defined and analyzed al- 
gorithms, and hence, virtually all speed im- 
provements have to be achieved in the failure 
source identification phase. 

• The diagnosed results must be updated as 
time elapses and new alarm information is 
received. These results must be accurate but 
need not have a fine resolution. This implies 
that in the early stages of diagnosis a large 
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component such as the Gas-Delivery Assem- 
bly can be identified as the fault source. The 
resolution of this fault source is further re- 
fined with the passage of time and additional 
alarm information to a unique valve inside 
the Gas-Delivery Assembly. 

• The User-Interface must present the current 
status of diagnosis in a comprehendible man- 
ner, reflecting the level and the granularity 
of the system under diagnosis, at which the 
diagnostics system is operating. 

SURVEY OF OTHER TECHNIQUES 

Rule-Based approaches or Expert Systems (1) 
have been the primary AI technique used for 
fault diagnostics. Expert Systems use IF - THEN 
rules as their knowledge representation structure, 
and an inference engine operating on these rules 
for detecting the source of failure. For diagnos- 
ing large-scale systems, Expert Systems are of- 
ten unsuitable since they cannot be efficiently 
modularized. Large Expert Systems also suffer 
from maintanence, testing and validation prob- 
lems. Often large Expert Systems remain incom- 
plete because of knowledge acquisition problems. 

Another approach has centered on using quali- 
tative models (2) which are a simulation of faulty 
system behavior. Fault sources are identified by 
comparing incoming data with all possible qual- 
itative simulation models until a match is found. 
This process may generate too many models and 
be too time-intensive for use in real-time fault 
diagnostics. 

A variety of graph-based techniques such 
as fault trees (3), event-trees (4) and cause- 
consequence diagrams (5) have also been used for 
fault diagnostics. An interesting approach pro- 
posed by Narayanan and Vishwanadham com- 
bines hierarchical fault propagation digraphs 
with fault trees (6). It is judged that a graph- 
based technique offers the best hope for real-time 
fault diagnostics. 


GRAPH-BASED APPROACH 

The basic philosophy of our graph-based ap- 
proach is based upon Multiple- Aspect modelling. 
The system under consideration is hierarchically 
decomposed from many aspects in order to yield 
many different models. A functional decom- 
position leads to the Hierarchical Process Mo- 
del (HPM) and a structural decomposition leads 
to a Hierarchical Physical Component Model 
(HPCM). A Hierarchical Fault Model (HFM) is 
developed in the context of HPM with links to 
the HPCM. 

The technique of hierarchical decomposition is 
widely used during model building for the follow- 
ing reasons: 

1. Design, knowledge acquisition and 
knowledge-base maintanence of large com- 
plex system becomes structured and easier. 

2. Running the same graph algorithms on 
smaller number of nodes many times takes 
lesser time than running them on the entire 
set of nodes in a system. For example it 
takes longer time to run an 0(n 3 ) algorithm 
on a graph with 200,000 nodes than it takes 
to run the same algorithm 200 times on a 
graph with 100 nodes. 

3. It is possible to conduct the search for the 
failure source on the HFM in a parallel man- 
ner, thus enabling speedy diagnosis. 

4. In most cases a large granularity component 
assembly can be identified as a failure source 
at an early stage, and then the search need 
only proceed in that component’s part of the 
model. 

Hierarchical Process Model 

A process in the HPM can be thought of as a 
functional unit carrying out a specific function in 
the system, by utilizing different physical compo- 
nents. Different processes on the same level may 
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interact with each other through shared physical 
components. Processes in the HPM can be as- 
sociated with many different components in the 
HPCM as shown in figure 1. In the context of 
each process the following are acquired: 

1. Process Failure-Modes. 

2. Process Alarms and alarm-generators. The 
alarm-generators accept sensor inputs and if 
needed, generate the appropriate alarm. 

3. Alarm Failure-Mode associations. 

4. Failure-Mode Physical Component associa- 
tions. 

Hierarchical Fault Model 

Each process in the the HPM also has its fault 
model. This model is derived from that its 
failure-mo des, and if present, the failure-modes 
of its subprocesses. All these failure-modes form 
nodes of a fault propagation digraph, with di- 
rected edges between individual failure-modes 
signifying a fault propagation possibility. Each 
edge in this graph is weighted with two param- 
eters a fault propagation probability and a fault 
propagation time interval in terms of a minimum 
and a maximum. The fault propagation digraph 
of a process on level i is shown in figure 2. The 
collection of all such fault propagation digraphs 
and failure-mode physical components associar 
tions results in the HFM. It is possible to extract 
a rough fault propagation digraph from process 
physical interactions since most faults can only 
propagate along physical connections. 

Model Building 

The process of model building and specification 
was aided by graphical editing tools developed 
at Vanderbilt University (7). Each model was 
built using its its own specific editor and also 
had its own declarative specification language. 
The output of an editor is the specification of 


a model in its declarative language. A sample 
output of the fault model editor is shown in fig- 
ure 3. These generated specifications are used 
by special-purpose interpreters for generating the 
run-time environment of the real-time fault diag- 
nostic system. 

DIAGNOSTIC ALGORITHMS 

By running suitable algorithms on a fault propar 
gation digraph, a failure source process and its 
source failure-mode can be found. Since each 
failure-mode in each process is also associated 
with physical components, the source faulty com- 
ponents can also be found. This process can be 
migrated to lower levels of process hierarchy in 
order to get a better resolution. Hence the failure 
source identification process consists of two algo- 
rithms the Failure Source Process Identification 
(FSPI) and the Fault Source Component Identi- 
fication (FSCI). An Inter Level Migration (ILM) 
process does the task of searching the process 
hierarchy for the best resolution of the possible 
faulty source component. 

Failure Source Process Identification 

The FSPI algorithm gets as input the fault- 
propagation digraph of a process to be diagnosed. 
It also receives all alarms currently ringing within 
that process and its subprocesses. This algorithm 
is accurately capable of detecting under most cir- 
cumstances, whether a single or a multiple fault 
occured in the process. On completion, this al- 
gorithm returns the possible fault source subpro- 
cesses and their fault source failure-modes. It 
uses the following constraints to determine the 
fault source in case of a single fault condition : 

1. Reachability Constraint : All ringing alarms 
shall be reachable from the detected source 
failure-modes. 

2. Monitor Constraint : No failure-mode with 
a normal alarm shall lie on a path from any 
of the detected source failure-modes to any 
of the failure-mo des with a ringing alarm. 
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FIGURE 2 : Fault Propagation Digraph 
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3. Temporal Constraint : All ringing alarms 
shall be individually reachable from the 
detected individual source failure-modes 
within the time interval computed from 
the time intervals found on shortest path 
between each individual alarm and source 
failure-mode pair. 

4. Consistency Constraint : There shall be 
no failure-mode with a ringing alarm whose 
reachability time from a source failure-mode 
is greater than, the maximum reachability 
time of a failure-mo de with a normal alarm 
from that detected source failure-mode. 

The algorithm is closed and complete and is thus 
suitable for speedy location of failure source pro- 
cesses. 

Fault Source Component Identification 

The FSCI algorithm takes as input a list of de- 
tected source failure processes and their source 
failure-modes. In case of a single fault condition 
it returns a union of all physical components asso- 
ciated with the source failure-modes. In case of a 
multiple fault condition it tries to find a common 
component amongst all the source failure-modes. 
If successfull it returns that common component, 
and if not it returns a union of all associated com- 
ponents. 

Inter Level Migration 

The ILM process detects the highest level of the 
process in which alarms are ringing. It then tries 
to search for a failure source by running the FSPI 
and later the FSCI algorithms on all processes 
in that level. The results are used to guide a 
breadth-first search of all processes present in the 
next lower level. Thu process continues until the 
lowest level of hierarchy is reached. At this point 
the best possible resolution of the failure source 
is achieved. If during this migration an alarm 
rings in a higher level than the current one un- 
der processing, the ILM goes to that higher level 


and restarts the diagnosis. At any point in time 
the ILM can present its best guess of the failure 
source in any level of process hierarchy. 

DIAGNOSTIC SYSTEM 
ARCHITECTURE 

The Real-Time Fault Diagnostic System re- 
quired; 

1. The use of a distributed computing architec- 
ture, 

2. Support for a concurrent programming mo- 
del and 

3. Integration of symbolic and numerical com- 
putations. 

The Multigraph Architecture (MGK) (8) has 
been used as a generic framework in the imple- 
mentation of the diagnostic system. The MGK is 
dataflow oriented computing system, capable of 
allocating computing nodes on a distributed net- 
work consisting of uniprocessor as well as multi- 
processor configurations. The language of the 
these computing nodes can be Lisp or C or Ada, 
thus enabling integration of symbolic and numer- 
ical computations. The MGK supports program- 
ming models such as autonomous communicating 
objects (9). 

The diagnostic system architecture is shown 
in figure 4. A Monitor task handles the job of 
acquiring sensor outputs and alarm-generation. 
The Diagnostic task consists of a diagnostic man- 
ager object, a diagnostic methods object and a 
display manager object. The diagnostic manager 
accepts as input all generated alarms and is in 
charge of conducting the inter-level search for the 
failure source. During this search it may send a 
process to the diagnostic methods object asking 
it to perform either the FSPI algorithm or the 
FSCI algorithm on it. The diagnostic methods 
object performs the requisite algorithm and re- 
ports the result back to the diagnostic manager. 
These results are used by the diagnostic manager 
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FIGURE 4: Diagnostic System Architecture 
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as a guide in its search. As soon as results are 
obtained for a level in the hierarchy they are sent 
over to the display manager for displaying them 
to the user. 

TESTING AND VALIDATION 

A real-time alarm pattern simulator is used to 
test and validate the real-time fault diagnostics. 
This simulator is automatically derived from the 
HFM. This simulator accepts as input any num- 
ber of failed components and the times at which 
they are supposed to have failed. It then gener- 
ates in real-time the pattern of alarms that would 
ring due to the failed components. These alarms 
serve as the input for the diagnostics system. 

CURRENT STATUS 

A real-time fault diagnostics system for a Cogen- 
erator plant currently exists on an HP 9000/300 
computer. The process model has 5 levels of hier- 
archy and 45 processes. The average number of 
failure-modes per process is about 4 and hence 
the average number of nodes in a fault propar 
gation digraph is about 10. An alarm pattern 
simulator is currently being used to test and val- 
idate the diagnostics system. A test in which 5 
alarms were generated in a span of 20 seconds, 
the diagnostics system located the failure source 
with the finest possible resolution in 25 seconds. 

CONCLUSIONS 

A hierarchical graph-based model appears to be 
the most suitable fault model for real-time fault 
diagnostics. The knowledge acquisition process 
can be automatized to a large extent. The graph 
algorithms developed are fast enough to be used 
for speedy diagnosis. The system is able to up- 
date itself and restart the diagnostic procedure 
if necessary. The breadth-first search strategy 
enables the system to provide an accurate diag- 
nosis of a coarse resolution that can be refined 
with passage of time and more alarm informa- 
tion. Testing and validation of such a system is 


easier since the test program can be automati- 
cally generated from the fault model itself. 
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